Skip to content

Improve to_datetime bounds checking #50183

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 16 commits into from
Jan 20, 2023

Conversation

rebecca-palmer
Copy link
Contributor

@rebecca-palmer rebecca-palmer commented Dec 11, 2022

to_datetime(errors='raise') is supposed to raise an exception on out-of-bounds input. However, it uses rounding to integer for this bounds checking but not the actual conversion, and hence, for input just outside the bounds it may instead return NaT (on x86) or output clipped to the bounds (on arm).

It also assumes that converting NaN to int gives NaT, which may not be true on unusual hardware. (This was how I originally noticed the problem, as Debian tries to build packages on a wide range of hardware.)

This patch fixes both issues. (In the Debian logs linked here, search for 'near-limits test', the first run is unpatched 1.3.5, the second run is patched 1.5.x.)

Its effect on speed is unclear: it is sometimes faster and sometimes slower, possibly because other load on the build machines varies.

  • [this bug does not appear to be previously reported ] closes #xxxx (Replace xxxx with the GitHub issue number)
  • Tests added and passed if fixing a bug or adding a new feature
  • All code checks passed.
  • [N/A ] Added type annotations to new arguments/methods/functions.
  • Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

result2=to_datetime(should_succeed, unit="Y", errors='coerce')
result3=to_datetime(should_succeed, unit="Y", errors='ignore')
tm.assert_series_equal(result1, result2)
tm.assert_series_equal(result1, result3)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A check that compared to a non to_datetime correct value would be better, but there don't seem to be any "allclose" checks in this file, and using exact equality risks rounding error failing it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. (Possibly the reason there weren't any allclose assert_almost_equal here was that that doesn't work on datetime arrays: pandas/_testing/asserters.py:731 calls assert_numpy_array_equal, which doesn't have an atol/rtol.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reported that as #50191

@MarcoGorelli MarcoGorelli self-requested a review December 11, 2022 17:11
@rebecca-palmer rebecca-palmer marked this pull request as ready for review December 12, 2022 07:28
@mroeschke mroeschke added the Datetime Datetime data dtype label Dec 13, 2022
Copy link
Member

@MarcoGorelli MarcoGorelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I've understood correctly:

Before:

  • values would get cast to i8 (thus losing some precision, e.g. 292.32702463 -> 292)
  • the i8 cast would then get cast back to f8, but once this has been done, the result could be smaller than what you started with, because of the loss of precision
  • the bounds check might (erroneously) not pass

Whereas now:

  • values are multiplied directly with mult
  • as there's no loss of precision, the out-of-bounds check no longer erroneously passes

That's one change. The other change is to use values != values in the mask, as was done in #48705. Perhaps let's merge that one first, then rebase this one?

I've just left a question, as one change looks good but orthogonal (it's fine, just checking)

Copy link
Member

@MarcoGorelli MarcoGorelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#48705 is in - could you rebase please and we can get this in too?

@github-actions
Copy link
Contributor

This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this.

@github-actions github-actions bot added the Stale label Jan 13, 2023
@mroeschke
Copy link
Member

Thanks for the pull request, but it appears to have gone stale. If interested in continuing, please merge in the main branch, address any review comments and/or failing tests, and we can reopen.

@mroeschke mroeschke closed this Jan 18, 2023
@MarcoGorelli
Copy link
Member

I'll add a commit here when I get a chance (hopefully early next week), I think we should ship this, and @rebecca-palmer did some seriously good work here

@mroeschke mroeschke reopened this Jan 18, 2023
@rebecca-palmer
Copy link
Contributor Author

Fixed conflicts (which was quite a large change, so wait for the CI before merging - as well as #48705 having done part of this, #50263 moved the code being patched to a non-Cython context, and #50301 forbids the case I was using as a test).

Unrelated to my changes, but it looks like #50263 may have accidentally undone the effect of #50301 for some input types - @jbrockmendel ?

@jbrockmendel
Copy link
Member

Unrelated to my changes, but it looks like #50263 may have accidentally undone the effect of #50301 for some input types - @jbrockmendel ?

That's entirely plausible. Would be happy to see a PR fixing that.

@rebecca-palmer
Copy link
Contributor Author

That CI fail looks like the same one main then had - retrying.

Copy link
Member

@MarcoGorelli MarcoGorelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me - only minor comments would be to add a whatsnew note in the 2.0.0 bug fixes section, and ideally to have a link to the github issue (or to the pull request if there's no issue) in the test (check some other tests for examples)

@rebecca-palmer
Copy link
Contributor Author

Fixed, but also removed the unnecessary copy, so wait for the CI to check that it really is unnecessary.

Copy link
Member

@MarcoGorelli MarcoGorelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me pending green, thanks @rebecca-palmer !

@rebecca-palmer
Copy link
Contributor Author

Does the Windows CI timing out suggest there's a hang bug in this, or does that just happen sometimes?

(Numpy Dev also fails on main, for reasons that are nothing to do with pandas: numpy/numpy#23033)

@mroeschke mroeschke removed the Stale label Jan 20, 2023
@mroeschke mroeschke added this to the 2.0 milestone Jan 20, 2023
@mroeschke mroeschke merged commit 0b04174 into pandas-dev:main Jan 20, 2023
@mroeschke
Copy link
Member

Thanks @rebecca-palmer

pooja-subramaniam pushed a commit to pooja-subramaniam/pandas that referenced this pull request Jan 25, 2023
* add test for float to_datetime near overflow bounds

* fix float to_datetime near overflow bounds

* fix typo and formatting

* fix formatting

* fix test to not fail on rounding differences

* don't use approximate comparison on datetimes, it doesn't work

* also can't convert datetime to float

* match dtypes

* TST: don't try to use non-integer years (see pandas-dev#50301)

* TST: don't cross an integer

(tsmax_in_days happens to be close to an integer,
and this is a test of rounding)

* PERF: remove unnecessary copy

* add whatsnew
raspbian-autopush pushed a commit to raspbian-packages/pandas that referenced this pull request Jan 26, 2023
Avoid assuming that NaN casts to NaT (= fails on riscv64/hppa ?)
Don't round to int for the bounds check when we don't for the real
conversion (wrong near the bounds, and maybe also a waste of time)

Author: Rebecca N. Palmer <[email protected]>
Forwarded: pandas-dev/pandas#50183


Gbp-Pq: Name float_to_datetime.patch
raspbian-autopush pushed a commit to raspbian-packages/pandas that referenced this pull request Jan 26, 2023
Avoid assuming that NaN casts to NaT (= fails on riscv64/hppa ?)
Don't round to int for the bounds check when we don't for the real
conversion (wrong near the bounds, and maybe also a waste of time)

Author: Rebecca N. Palmer <[email protected]>
Forwarded: pandas-dev/pandas#50183


Gbp-Pq: Name float_to_datetime.patch
raspbian-autopush pushed a commit to raspbian-packages/pandas that referenced this pull request Feb 2, 2023
Avoid assuming that NaN casts to NaT (= fails on riscv64/hppa ?)
Don't round to int for the bounds check when we don't for the real
conversion (wrong near the bounds, and maybe also a waste of time)

Author: Rebecca N. Palmer <[email protected]>
Forwarded: pandas-dev/pandas#50183


Gbp-Pq: Name float_to_datetime.patch
raspbian-autopush pushed a commit to raspbian-packages/pandas that referenced this pull request Mar 1, 2023
Avoid assuming that NaN casts to NaT (= fails on riscv64/hppa ?)
Don't round to int for the bounds check when we don't for the real
conversion (wrong near the bounds, and maybe also a waste of time)

Author: Rebecca N. Palmer <[email protected]>
Forwarded: pandas-dev/pandas#50183


Gbp-Pq: Name float_to_datetime.patch
raspbian-autopush pushed a commit to raspbian-packages/pandas that referenced this pull request Dec 19, 2023
Avoid assuming that NaN casts to NaT (= fails on riscv64/hppa ?)
Don't round to int for the bounds check when we don't for the real
conversion (wrong near the bounds, and maybe also a waste of time)

Author: Rebecca N. Palmer <[email protected]>
Forwarded: pandas-dev/pandas#50183


Gbp-Pq: Name float_to_datetime.patch
raspbian-autopush pushed a commit to raspbian-packages/pandas that referenced this pull request Dec 19, 2023
Avoid assuming that NaN casts to NaT (= fails on riscv64/hppa ?)
Don't round to int for the bounds check when we don't for the real
conversion (wrong near the bounds, and maybe also a waste of time)

Author: Rebecca N. Palmer <[email protected]>
Forwarded: pandas-dev/pandas#50183


Gbp-Pq: Name float_to_datetime.patch
raspbian-autopush pushed a commit to raspbian-packages/pandas that referenced this pull request Feb 6, 2024
Avoid assuming that NaN casts to NaT (= fails on riscv64/hppa ?)
Don't round to int for the bounds check when we don't for the real
conversion (wrong near the bounds, and maybe also a waste of time)

Author: Rebecca N. Palmer <[email protected]>
Forwarded: pandas-dev/pandas#50183


Gbp-Pq: Name float_to_datetime.patch
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Datetime Datetime data dtype
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants